Statistical Natural Language Processing Method for Variant Texts Segmentation
نویسندگان
چکیده
It is well known that some techniques have already been developed to automatically subdivide texts into multiparagraph subtopic passages, such as TextTiling methodology proposed by Hearst. However, an additional algorithm is needed to perform a similar task for parallel or variant texts, because ambiguous and complicated traces of cross citation among them might often generate some sinuous patterns of lexical co-occurrence that make fuzzy the boundaries of units of coherent episode. In other words, we are confronted with a sort of Frame question of how we partition off the texts to respect their own genealogy and avoid irrelevant interpretation of source reference. In this paper, we propose a new statistical natural language processing method to partition off the variant texts. The Parallel Synoptic Tables (PST) in the Synoptic Gospels, Matthew, Mark and Luke are taken as examples of variant texts to which our new method will be applied. The method makes it possible for us to obtain the Computed Synoptic Tables (CST) by providing us with new objective segmentations of the parallel texts in Synoptic Gospels.
منابع مشابه
Do We Need Chinese Word Segmentation for Statistical Machine Translation?
In Chinese texts, words are not separated by white spaces. This is problematic for many natural language processing tasks. The standard approach is to segment the Chinese character sequence into words. Here, we investigate Chinese word segmentation for statistical machine translation. We pursue two goals: the first one is the maximization of the final translation quality; the second is the mini...
متن کاملAccurate Word Segmentation and POS Tagging for Japanese Microblogs: Corpus Annotation and Joint Modeling with Lexical Normalization
Microblogs have recently received widespread interest from NLP researchers. However, current tools for Japanese word segmentation and POS tagging still perform poorly on microblog texts. We developed an annotated corpus and proposed a joint model for overcoming this situation. Our annotated corpus of microblog texts enables not only training of accurate statistical models but also quantitative ...
متن کاملNormalized Accessor Variety Combined with Conditional Random Fields in Chinese Word Segmentation
The word is the basic unit in natural language processing (NLP), as it is at the lexical level upon which further processing rests. The lack of word delimiters such as spaces in Chinese texts makes Chinese word segmentation (CWS) an interesting while challenging issue. This paper describes the in-depth research following our participation in the fourth International Chinese Language Processing ...
متن کاملA Topic Segmentation of Texts based on Semantic Domains
1 LIMSI-CNRS. BP 133, 91403 Orsay Cedex, France. email: [ferret,grau]@limsi.fr Abstract. Thematic analysis is essential for many Natural Language Processing (NLP) applications, such as text summarization or information extraction. It is a two-dimensional process that has both to delimit the thematic segments of a text and to identify the topic of each of them. The system we present possesses th...
متن کاملMostly-unsupervised statistical segmentation of Japanese kanji sequences
Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and syntactic analysis or on pre-segmented data; but these are labor-intensive, and the lexico-syntactic techniques are vulnerable to the unknown word problem. In contrast, we introdu...
متن کامل